JSM 2025: Estimating the impact of ambient air pollution on overall mortality in the presence of preferential sampling and measurement error using electronic health record data
Authors
Affiliation
Tae Yoon Lee
Quantitative Sciences Unit, Stanford University School of Medicine
Chloe C. Su
Quantitative Sciences Unit, Stanford University School of Medicine
Summer S. Han
Quantitative Sciences Unit, Stanford University School of Medicine
Published
August 2, 2025
Abstract
A network of monitoring sites is often not well-designed for accurately mapping ambient (outdoor) air pollution due to external factors, such as budget constraints and public opinions. As such, naively using point measurements from the monitoring network can lead to biased mapping. This can have profound downstream implications for environmental health studies when this mapping is used to estimate ambient air pollution exposure at participants’ locations. In this talk, we will address this potential bias due to preferential sampling in the design of a monitoring network for estimating the ambient air pollution field in California. We will utilize a recently developed spatio-temporal statistical framework that simultaneously models the air pollution field and monitoring site selection process. Further, we will examine the downstream implications in quantifying the effects of ambient air pollution on 5-year overall mortality among patients diagnosed with lung cancer using electric health record data from Stanford Health Care, an academic medical center. We will employ a Bayesian model to incorporate the measurement error in the air pollution exposure.
Part II: Outdoor Air Pollution Modeling
This part requires computation. The R version and platform used to generate the results are shown at the end.
Import data and packages
library(INLA)
Loading required package: Matrix
This is INLA_25.06.07 built 2025-06-11 19:15:03 UTC.
- See www.r-inla.org/contact-us for how to get help.
- List available models/likelihoods/etc with inla.list.models()
- Use inla.doc(<NAME>) to access documentation
library(inlabru)
Loading required package: fmesher
library(sf)
Linking to GEOS 3.13.0, GDAL 3.5.3, PROJ 9.5.1; sf_use_s2() is TRUE
library(sp)library(reshape2)library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
The following object is masked from 'package:reshape2':
smiths
The following objects are masked from 'package:Matrix':
expand, pack, unpack
library(stringr)df_border <-read_rds("../data/df_border_scaled.rds")df_border_unscaled <-st_union(read_rds("../data/ca_border.rds"))df_pm25_annual <-read_rds("../data/df_pm25_annual_scaled.rds") %>%group_by(site_number) %>%mutate(n=n()) %>%ungroup() %>%# remove if the site was active only for one yearfilter(n>2) %>%select(-n) %>%mutate(site_number=as.factor(as.numeric(as.factor(site_number))))xlabel <-expression(paste("Annual average PM 2.5 (",mu, "g", "/", m^3, ")",sep=""))
Data Preparation for Modeling
To prepare the data for modeling, we perform the following steps:
Calculate the annual exposure of PM 2.5 for each monitoring site.
Descale the annual exposure by dividing it the overall mean across the years and sites.
Use the logarithm of the descaled annual exposure.
For each site and year, generate the presence indicator \(R\) and its lag to incorporate the retention effect.
Convert year into the 0-1 scale and rename it as time and create an auxiliary variable zero.
Fitting a Bayesian spatio-temporal model can be extremely computionally demanding. Integrated Nested Laplace Approximation utilizes the Stochastic Partial Differentiation Equation (SPDE) to estimate the spatial autocorrelation much more efficiently based on a mesh - a network of discrete sampling locations which are interpolated to approximate a continuous process (e.g., air pollution field) in space. For more details, please see the INLA tutorials.
The size of the mesh plays a crucial role. If it is too big, the fitting process will be too crude. If it too small, the fitting process could be too computationally demanding, and it might cause convergence issues in our study, as there aren’t too many observations. We choose the following mesh by trial-and-error:
# Maybe we should consider set the PC prior using data infospde_obj <-inla.spde2.pcmatern(mesh = mesh, alpha =2, prior.range =c(2.5*edge_in, 0.01),prior.sigma =c(1.5*sqrt(mean(df_var_annual$var_pm)), 0.1),constr = T)
Generating Pseudo-Sites
We add pseudo-sites (i.e., locations that are not monitored) to all vertices in the mesh and create an expanded dataset in order to capture the impact of PS on pseudo-sites. For details, see Watson, Zidek, and Shaddick (2019).
# expanded datasetdf_pm25_flat <-dcast(df_pm25_annual, site_number + north + east ~ year, value.var ="log_annual_mean")utm_crs_km <-st_crs("+proj=utm +zone=10 +ellps=WGS84 +units=km")nodes_locs <-data.frame("N"= mesh$loc[, 2]*10, "E"= mesh$loc[, 1]*10) %>%st_as_sf(coords =c("E", "N"), crs = utm_crs_km)idx_in <-st_contains(df_border_unscaled, nodes_locs) %>%as.matrix() %>%c()sites_locs <-data.frame(east = df_pm25_flat$east,north = df_pm25_flat$north)nodes_in_locs <-data.frame(east = mesh$loc[idx_in, 1], north = mesh$loc[idx_in, 2])pseudo_sites <-setdiff(nodes_in_locs, sites_locs) df_pm25_flat_pseudo <-data.frame(matrix(data=NA, nrow=dim(pseudo_sites)[1], ncol=dim(df_pm25_flat)[2])) %>%`colnames<-`(colnames(df_pm25_flat)) %>%mutate(north = pseudo_sites$north,east = pseudo_sites$east)df_pm25_flat_expand <- df_pm25_flat %>%rbind(df_pm25_flat_pseudo)# Create site number for all pseudo and real sitesdf_pm25_flat_expand$site_number <-1:nrow(df_pm25_flat_expand)ggplot(df_pm25_flat_expand) +gg(mesh) +geom_point(aes(x = east, y = north), color =c(rep("blue", nrow(df_pm25_flat)), rep("green", nrow(df_pm25_flat_pseudo))),size =c(rep(5,nrow(df_pm25_flat)),rep(2, nrow(df_pm25_flat_pseudo))),alpha=c(rep(1,nrow(df_pm25_flat)),rep(0.5, nrow(df_pm25_flat_pseudo)))) +xlab("East / 10km") +ylab("North / 10km") +coord_fixed() +theme_minimal(base_size=20)
# Transform it into long format df_pm25_expand <-melt(df_pm25_flat_expand, id.vars =c(1,2,3), variable.name ='year', value.name ='log_annual_mean')no_sites_expand <-nrow(df_pm25_flat_expand)stopifnot(nrow(df_pm25_expand) == no_sites_expand * num_timepoints)df_pm25_expand <- df_pm25_expand %>%mutate(time =as.numeric(year),time = (time-min(time))/(max(time)-min(time)))df_pm25_expand$locs <-coordinates(df_pm25_expand[, c("east", "north")])df_pm25_expand <-select(df_pm25_expand, !c("east", "north"))df_pm25_expand$site_number <-as.numeric(as.factor(df_pm25_expand$site_number))# Site-selection indicatordf_pm25_expand <- df_pm25_expand %>%mutate(R=as.numeric(!is.na(log_annual_mean)),R_lag=c(rep(NA, no_sites_expand), R[1:(nrow(df_pm25_expand)-no_sites_expand)]),# the auxiliary variablezero =0)
Model Fitting
We will fit three models:
Fitting the site process and air pollution field independently under the assumption of no PS.
Fitting the site process and air pollution field jointly to detect the PS.
Fitting the site process and air pollution field jointly on the expanded pseudo-site dataset to detect the PS AND estimate the air pollution field that accounts for potential effect of PS.
Fitting the site process and the air pollution field independently
Please see the slides for details. It takes less than <1 min to fit the models independently (which may vary across R versions and platforms).
comp_joint_indep <-~# Components for observation modelIntercept_obs(1) +Time_obs_1(time) +Time_obs_2(time^2) +Random_obs_0(site_number, model ="iid2d", n = num_sites*2, constr=TRUE) +Random_obs_1(site_number, weights = time, copy ="Random_obs_0") +Spatial_obs_0(locs, model = spde_obj) +Spatial_obs_1(locs, weights = time, model = spde_obj) +Spatial_obs_2(locs, weights = time^2, model = spde_obj) +# Components for site selection modelIntercept_slc(1) +Time_slc_1(time) +Time_slc_2(time^2) +R_lag_slc(R_lag) +AR_slc(year, model='ar1', hyper=list(theta1=list(prior="pcprec",param=c(2, 0.01)))) +Spatial_slc(locs, model = spde_obj)like_obs <-bru_obs(formula = log_annual_mean ~ Intercept_obs + Time_obs_1 + Time_obs_2 + Random_obs_0 + Random_obs_1 + Spatial_obs_0 + Spatial_obs_1 + Spatial_obs_2,family ="gaussian",data = df_pm25)like_slc <-bru_obs(formula = R ~ Intercept_slc + Time_slc_1 + Time_slc_2 + R_lag_slc + AR_slc + Spatial_slc,family ="binomial",Ntrials =rep(1, times =length(df_pm25$R)),data = df_pm25)bru_options_reset()bru_options_set(bru_max_iter =10,control.inla =list(strategy ="gaussian",int.strategy ='eb'),control.predictor=list(link=1),bru_verbose =3)start.time <-Sys.time()fit_bru_indep <-bru(comp_joint_indep, like_obs, like_slc)end.time <-Sys.time()time.taken.independent <- end.time - start.timetime.taken.independent # 1 minwrite_rds(fit_bru_indep,"../results/bru_indep.rds")
For the joint models, we will exclude the spatial fields that evolve over time due to substantial uncertainty in their estimates: Spatial_obs_1 and Spatial_obs_2.
Joint modeling of the air pollution map and site process on the expanded dataset
It takes less than <5 min to fit the models jointly (which may vary across R versions and platforms).
inla.setOption(fmesher.timeout=120)inla.setOption(inla.timeout=1000)inla.setOption(safe=T)comp_aux_expand <-~# Components for observation modelIntercept_obs(1) +Time_obs_1(time) +Time_obs_2(time^2) +Random_obs_0(site_number, model ="iid2d", n = no_sites_expand*2, constr=TRUE) +Random_obs_1(site_number, weights = time, copy ="Random_obs_0") +Spatial_obs_0(locs, model = spde_obj) +# Components for site selection modelIntercept_slc(1) +Time_slc_1(time) +Time_slc_2(time^2) +R_lag_slc(R_lag) +AR_slc(year, model='ar1', hyper=list(theta1=list(prior="pcprec",param=c(2, 0.01)))) +Spatial_slc(locs, model = spde_obj) +Comp_share1(site_number, copy ='Comp_aux1', fixed =FALSE) +Comp_share2(site_number, copy ='Comp_aux2', fixed =FALSE) +# Components for 1st auxiliary modelRandom_aux1_0(site_number, copy ="Random_obs_0", fixed =TRUE) +Random_aux1_1(site_number, weights = time, copy ="Random_obs_1", fixed =TRUE) +Comp_aux1(site_number, model ='iid', hyper =list(prec =list(initial =-20, fixed=TRUE))) +# Components for 2nd auxiliary modelSpatial_aux2_0(locs, copy ="Spatial_obs_0", fixed =TRUE) +Comp_aux2(site_number, model ='iid', hyper =list(prec =list(initial =-20, fixed=TRUE)))# All likelihoodslike_obs <-bru_obs(formula = log_annual_mean ~ Intercept_obs + Time_obs_1 + Time_obs_2 + Random_obs_0 + Random_obs_1 + Spatial_obs_0, family ="gaussian",data = df_pm25_expand)like_slc_share <-bru_obs(formula = R ~ Intercept_slc + Time_slc_1 + Time_slc_2 + R_lag_slc + AR_slc + Spatial_slc + Comp_share1 + Comp_share2,family ="binomial",Ntrials =rep(1, times =length(df_pm25_expand$R)),data = df_pm25_expand)like_aux_1 <-bru_obs(formula = zero ~ Random_aux1_0 + Random_aux1_1 + Comp_aux1,family ="gaussian",data = df_pm25_expand)like_aux_2 <-bru_obs(formula = zero ~ Spatial_aux2_0 + Comp_aux2,family ="gaussian",data = df_pm25_expand)bru_options_reset()bru_options_set(bru_max_iter =20,control.inla =list(strategy ="auto", int.strategy ='auto'),control.predictor=list(link=1),control.family =list(list(),list(),list(hyper =list(prec =list(initial =30, fixed=TRUE))),list(hyper =list(prec =list(initial =30, fixed=TRUE)))),bru_verbose =3)start_time_expand <-Sys.time()fit_bru_joint_expand <-bru(comp_aux_expand, like_obs, like_slc_share, like_aux_1, like_aux_2)end_time_expand <-Sys.time()runtime_expand <- start_time_expand - end_time_expandruntime_expandwrite_rds(fit_bru_joint_expand,"../results/bru_joint_expand.rds")
Model parameters
We extract the key parameters from all the models fitted.
fit_bru_indep <-read_rds("../results/bru_indep.rds") # the joint model on the original datasetfit_bru_joint <-read_rds("../results/bru_joint.rds") # the joint model on the original datasetfit_bru_joint_expand <-read_rds("../results/bru_joint_expand.rds") # the joint model on the expanded datasetaverage_pm_annual <-read_rds("../results/average_pm_annual.rds")chosen_row_names <-c("Intercept_obs","Time_obs_1","Time_obs_2","Rho1:2 for Random_obs_0","Range for Spatial_obs_0","Stdev for Spatial_obs_0","Intercept_slc","Time_slc_1","Time_slc_2","R_lag_slc","Rho for AR_slc","Precision for AR_slc","Range for Spatial_slc","Stdev for Spatial_slc","Beta for Comp_share1","Beta for Comp_share2")row_names_labels <-c("gamma_0","gamma_1","gamma_2","rho_Z","range_obs","std_obs","alpha_0","alpha_1","alpha_2","alpha_{retention}","rho_R","1/sigma^2_R","range_slc", "std_slc","d_b","d_\beta")df_param_naive <-rbind(fit_bru_indep$summary.fixed %>%select(mean,sd) %>% tibble::rownames_to_column(),fit_bru_indep$summary.hyperpar %>%select(mean,sd) %>% tibble::rownames_to_column()) %>%mutate(type="naive")df_param_expand <-rbind(fit_bru_joint_expand$summary.fixed %>%select(mean,sd) %>%rownames_to_column(),fit_bru_joint_expand$summary.hyperpar %>%select(mean,sd) %>%rownames_to_column()) %>%mutate(type="pseudo-site")df_param_joint <-rbind(fit_bru_joint$summary.fixed %>%select(mean,sd) %>%rownames_to_column(),fit_bru_joint$summary.hyperpar %>%select(mean,sd) %>%rownames_to_column()) %>%mutate(type="presence only")df_param <-rbind(df_param_naive, df_param_expand, df_param_joint)df_chosen_param <- df_param %>%filter(rowname %in% chosen_row_names) %>%mutate(mean =formatC(round(mean,2),format='f',digits=2),sd =formatC(round(sd,2),format='f',digits=2),val =str_c(mean, " (",sd,")")) %>%select(-mean,-sd) %>%pivot_wider(names_from=type,values_from=val)chosen_row_names_obs <-c("Intercept_obs","Time_obs_1","Time_obs_2","Rho1:2 for Random_obs_0","Range for Spatial_obs_0","Stdev for Spatial_obs_0")chosen_row_names_slc <-c("Intercept_slc","Time_slc_1","Time_slc_2","R_lag_slc","Rho for AR_slc","Precision for AR_slc","Range for Spatial_slc","Stdev for Spatial_slc","Beta for Comp_share1","Beta for Comp_share2")row_names_labels_obs <-c("gamma_0","gamma_1","gamma_2","rho_Z","range_obs","std_obs")row_names_labels_slc <-c("alpha_0","alpha_1","alpha_2","alpha_{retention}","rho_R","1/sigma^2_R","range_slc", "std_slc","d_b","d_\beta")df_obs <- df_chosen_param %>%filter(rowname %in% chosen_row_names_obs) %>%mutate(rowname=factor(rowname,levels=chosen_row_names_obs,labels = row_names_labels_obs)) %>%select(rowname,naive,`presence only`,`pseudo-site`)df_slc <- df_chosen_param %>%filter(rowname %in% chosen_row_names_slc) %>%mutate(rowname=factor(rowname,levels=chosen_row_names_slc,labels = row_names_labels_slc)) %>%select(rowname,naive,`presence only`,`pseudo-site`)df_obs
Warning: Combining variables of class <integer> and <factor> was deprecated in ggplot2
3.4.0.
ℹ Please ensure your variables are compatible before plotting (location:
`combine_vars()`)
Warning: Combining variables of class <factor> and <integer> was deprecated in ggplot2
3.4.0.
ℹ Please ensure your variables are compatible before plotting (location:
`combine_vars()`)
vec_min_distance <-read_rds("../results/vec_min_distance.rds")ggplot(data=vec_min_distance,aes(x=distance)) +geom_histogram(linewidth=2,binwidth=5) +xlab("Distance to the nearest monitor (km)") +ylab("Number of patients") +scale_x_continuous(breaks=seq(0,50,by=10)) +theme_minimal(base_size=35)
summary(vec_min_distance$distance)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.07413 5.31238 9.29516 10.60386 14.30765 53.35964
quantile(vec_min_distance$distance,.9)
90%
20.02748
Prediction at patient location
We predict the annual average exposure to PM2.5 for each patient in our cohort at baseline.